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Abstract 

We initiate the rigorous study of classification in semimetric spaces, which are point sets with 
a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. 
For metric spaces, the doubling dimension essentially characterizes both the runtime and sample 
complexity of classification algorithms — yet we show that this is not the case for semimetrics. 
Instead, we define the density dimension and discover that it plays a central role in the statistical 
and algorithmic feasibility of learning in semimetric spaces. We present nearly optimal sample 
compression algorithms and use these to obtain generalization guarantees, including fast rates. 
The latter hold for general sample compression schemes and may be of independent interest. 


1 Introduction 

The problem of learning in non-metric spaces has been of significant recent interest, being the subject 
of a 2010 COLT workshop and a central topic of all three SIMBAD conferences. In this paper, we 
initiate the study of efficient statistical learning in semimetric spaces, which are point sets endowed 



rigorous learning results in semimetric spaces prior to this work. 


Background and motivation. Much of the existing machinery for classification algorithms, as well 
as generalization bounds, depends strongly on the data residing in a Hilbert space. For some important 
applications, this structural constraint severely limits the applicability of existing methods. Indeed, it 
is often the case that the data is naturally endowed with some metric strongly dissimilar to the familiar 
Euclidean norm. 

Consider images, for example. Although these can be naively represented as coordinate-vectors 
in M' / , the Euclidean (or even i p ) distance between the representative vectors does not correspond 
well to the one perceived by human vision. Instead, the earthmover distance is commonly used in 
vision applications BRubner et all 2000]. Yet representing earthmover distances using any fixed l p 


1 Some authors use the term “semimetric” to mean pseudometrics. These preserve much of the structure of metrics, the 
only difference being that they allow distinct points to have distance 0. Our usage appears to be the standard one. 
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norm unavoidably introduces very large inter-point distortion BNaor and Schechtmanl. 20 0711 . poten¬ 
tially corrupting the data geometry before the learning process has even begun. Nor is this issue 
mitigated by kernelization, as kernels necessarily embed the data in a Hilbert space, again incurring 
the aforementioned distortion. A similar issue arises for strings: These can be naively treated as vec¬ 
tors endowed with different l p metrics, but a much more natural metric over strings is the edit distance, 
which is similarly known to be strongly non-Euclidean | Andoni and Krauthgamer . 2Qloll . Additional 
limitations of kernel methods are articulated in Balcan et aL [2008b]. 

These concerns have led researchers to seek out algorithmic and statistical approaches that apply 
in greater generality. A particularly fruitful recent direction has focused on metric spaces. Metric 
spaces are point sets endowed with a distance function that is non-negative and symmetric, and also 
satisfies the triangle equality. Since metric spaces may be highly complex — for example, they in¬ 
clude infinite-dimensional Hilbert spaces — the discussion is typically restricted to metric spaces with 
bounded intrinsic dimension. The latter may be formalized, e.g., via metric entropy numbers or the 
doubling dimension. This pa radig m captures some natural distance metrics, such as earthmover and 
edit distances I Gottlieb et al. . 2014a 1. 

Assuming no additional structure beyond inter-point distances, one is left (almost tautologi¬ 
cally) with proximity-based methods — and all the learning algorithms considered in this paper 
will be valiants of the Nearest Neighbor classifier. For metric spaces, it is known that a sam- 
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], and that exponential dependence on ddim is in general unavoidable 

| Shalev-Shwartz and Ben-Pavid, 2014]. 

As for algorithmic runtimes, the naive nearest-neighbor clas- 


sifier evaluates queries in 0{n) time (where n is the sample size); however, an approximate nearest 
neighbor can be found in time 2 0( ' lllirr 7 log n. If one desires runtimes depending not on n but on 
the geometry (say, margin 7 ) of the data, one may achieve a sample compression scheme of size 
7— c> (ddim) ) anc j q j s NP-hard to achieve a significantly better compression [Gottlieb et all l2014bll . 
Hence, the doubling dimension in some sense characterizes the statistical and computational difficulty 
of learning in metric spaces. We note that all learning bounds and algorithms for doubling spaces rely 
on the packing property for these spaces (Lemma[T|), which upper-bounds the size of a point set whose 
inter-point distance is bounded from below. 

While metric spaces are significantly more general than Hilbertian ones, they still do not cap¬ 
ture many common distance functions used by practitioners. These non-metric distances include 
the Jensen-Shannon divergence, which appears in statistical applications IlFuglede and TonsOc . 2004, 


Goodfellow et all [20140, fc- median Hausdorff distances and t v distances with 0 < p < 1, which 


appear in vision applications BDubuisson and .Tainll 19941 Lfacobs et all l2000P — all of which are semi ¬ 
metrics. An additiona l line of work by iDubuisson and Jainl 1199411 and [Jacobs et al.1 1120001. 1 1 99811 . 


Weinshall et al. [1998] underscored the effectiveness of non-metric distances in various applica 


tions (mainly vision), and among these, semime t rics again play a prominent role I Basri et all 
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Main results. We initiate the rigorous study of classification for semimetric spaces. 

Our first contribution is a fundamental insight into semimetric spaces. Unlike in metric spaces, 
where the covering numbers AAM and the packing numbers JA(■) are related via A4(2e) < M(e) < 
M(e) (see e.g., Alon et al. [1997]), violating the triangle inequality breaks this connection between 
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covering and packing. Particularly, for semimetrics, a doubling constant (while well-defined) does not 
imply a packing property (Lemma[2]). As a consequence, the bounds in the host of results constituting 
the theory of learning in doubling metric spaces are not applicable to semimetrics. Crucially, however, 
we show that semimetrics with a finite density constant do obey a packing property (Lemma [2]), and 
so the latter serves as a natural basis for statistical and algorithmic bounds for classification in these 
spaces. This insight is developed further in Lemma [3] While for metric spaces the doubling and 
density constants are never very far apart, in semimetric spaces the gap may be arbitrarily large. 

However, the above discussion does not imply that learning results for metric spaces are automati¬ 
cally portable into semimetrics simply by replacing the doubling constant by the density constant. For 
example, although the nearest-neighbor classifier is still well-defined in semimetric spaces, and may 
naively be evaluated on queries in 0(n) time, relaxing to approximate nearest neighbors no longer 
provides the exponential speedup that it does in metric spaces (Lemma[ 6 ]). Simpl y put, without the tri 


angle inequality, the hierarchy-based search methods, such as lBevgelzimer et al.111200611 . iGottlieb et al 
[ 2010 ] and related approaches, all break down. 


Fortunately, there is a technique that survives violations of the triangle inequality — namely, 
sample compression. The latter is achieved by extracting a 7 -net, where 7 is the sample margin (The- 
orem|7|). This can be done in runtime min jn 2 , n (1 / 7 )° !<lens - l |, where dens is the density dimension 
defined in ([2]); this is worse than the corresponding state of the art for metric spaces (Lemma [4]). The 
net-extraction procedure in effect compresses the sample from size n to (l/ 7 )°( dens ), which is nearly 
optimal unless P=NP (Theorem [ 8 ]). 

On the statistical front, we give a compression-based generalization bound that smoothly inter¬ 
polates between the consistent 0(l/n) and agnostic Oil/s/Ti) decay regimes (Theorem ITTb. This 
“fast rate” holds for general compression schemes and may be of independent interest. Applied to 
margin-based semimetric sample-compression schemes, it yields the bound in Theorem [l3j which 
is amenable to efficient Structural Risk Minimization (Theorem [9} and cannot be substantially im¬ 
proved unless P=NP (Theorem [HJ. The lower bound in Theorem fl4l shows that even under margin 
assumptions, there exist adversarial distributions forcing the sample complexity to be exponential in 
dens. 


Related work. In a series of papers, Balcan and Bluml 1 2006 1. Balcan et al. 1 2008c aUbll developed 
a theory of learning with similarity functions, which resemble kernels but relax the requirement of 
being positive definite. Learning is accomplished by embedding the data into an appropriate Euclidean 
space and performing large-margin separation. Hence, this approach effectively extracts the implicit 
Euclidean structure encoded in the similarity function, but does not seem well-suited for inherently 
non-Euclidean data. Wan g et ak 1200711 extended this framework to dissimilarity functions, obtaining 
analogous results. 


2 Preliminaries 

Semimetric spaces. Throughout this paper, our instance space X will be endowed with a semimetric 
p : X x X — y [0,00), which is a non-negative symmetric function verifying p(x, x') = 0 x = 

x' for all x, x' E X. If the semimetric space (X, p) additionally satisfies the triangle inequality, 
p(x,x') < p(x,x") + p(x",x') for all x,x',x" E X, then p is a metric. The distance between two 
sets A, B in a semimetric space is defined by p(A,B) = inf p(x,x'). For x € X and r > 0, 

x&A,x'eB 
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denote by B r (x ) = {y E X : p(x, y) < r} the open r-ball about x. The radius of a set is the radius 
of the smallest ball containing it: rad(A) = inf {r > 0 : 3x € A, A C B r (x)}. 


Doubling and density constants. Let A = X(X) be the smallest number such that every open ball 
in X can be covered by A open balls of half the radius, where all balls are centered at points of X. 
Formally, 

\(X) = min{A E N : Vx E X, r > 0 3xi,..., x\ E X : B r (x) C U f =1 B r / 2 (xi)}. 

Then A is the doubling constant of X , and the doubling dimension of X is ddim(T’) = log 2 A. 

An r-net of a set A C X is any maximal subset A having mutual inter-point distance at least r. 
The r-packing number M (r, A) of A is the maximum size of any r-net of A: 


M(r,A ) = max {IE) : E C A, (x, y E E) A (x / y) 


p(x,y) > r} 


( 1 ) 


Gottlieb and Krauthgamer 1201311 defined the density constant p(X) as the smallest number such 


that any open r-radius ball in X contains at most // points at mutual inter-point distance at least r/2: 


p(X) = min jp, E N : (x E X) A (r > 0) =>■ M. (^-,B r (x)^j < p j , 
and we define the density dimension of X by dens(A’) = log 2 p{X). 


( 2 ) 


Learning mode l. We work in th e standard agnostic learning model llMohri et all 12012 . 


S hale v -Shwa rt z and Ben - David . 2014], whereby the learner receives a sample S consisting of n la¬ 


beled examples ( X ,, k]). drawn iid from an unknown distribution over X x {—1,1}. All subsequent 
probabilities and expectations will be with respect to this distribution. Based on the training sample 
S, the learner produces a hypothesis h : X {—1,1}, whose empirical error is defined by err(fi) = 
_1 Sr=i 1 {htxA^Yi} an( i whose generalization error is defined by err(fi) = P(/i(X) ^ Y). 


n 


Sub-sample, margin, and induced 1-NN. In a slight abuse of notation, we will blur the distinction 
between S C X as a collection of points in a semimetric space and S E (X x (—1, l}) n as a 
sequence of labeled examples. Thus, the notion of a sub-sample S C S partitioned into its positively 
and negatively labeled subsets as S = 5+ U <S_ is well-defined. The margin of ,5, defined by 


marg(5) = p(S+,S-), 


(3) 


is the minimum distance between a pair of opposite-labeled points (see Fig. Q] in the Appendix). In 
degenerate cases where one of S+. S- is empty, marg(S) = oo. A sub-sample S naturally induces 
the 1-NN classifier hg, via 


hdx) = sign (p(x, S-) - p(x, S + )). 


(4) 


The problem of nearest-neighbor condensing is to produce the minimal subsample S C S so that 
the 1-NN classifier is consistent with S, i.e. has zero training error. This problem was considered by 
Gottlieb et all |2014b] in the context of doubling metric spaces, where they demonstrated that it is NP- 
hard to find the minimal S, even approximately (within a factor 2°F <1 <i |m !S') iog(2rad(S)/ marg(s))) 1 " (l) ) 
of |5|). This result translates immediately to the more general semimetric spaces. 
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3 Metric vs. Semimetric spaces 


In this section, we consider the basic tools used in learning algorithms for doubling metric spaces. 
We show that in semimetric spaces, low doubling dimension does not imply a low packing number 
(Lemma [2]). Hence, all learning algorithms developed for metric spaces relying on the doubling di¬ 
mension are no longer efficient in semimetric spaces. We then show that a low density constant does 
imply a low packing number, even for semimetric spaces. An even more stark distinction is estab¬ 
lished: in doubling metric spaces, the doubling and density constants are never very far apart, while 
in semimetric spaces the gap may be arbitrarily large. 

These results suggest that the semimetric density constant will play the role of the metric doubling 
constant. This intuition is borne out in some aspects (Lemma [Q) and proves to be spurious in others 
(Lemma |6j. When controlling for both constants, approximate nearest-neighbor search in semimetric 
spaces cannot be performed nearly as efficiently as in doubling metric spaces. 

The results presented in this section serve as the theoretical basis motivating our learning algo¬ 
rithms (Section HJ. 


3.1 Doubling constant vs. the density constant 


The following lemma states the well-known packing property of doubling spaces (see for example 
Krauthgamer and Lee [2004]). It is a basic component of all the ddim-based proximity methods. 


Note the use of the triangle inequality in the proof. 


Lemma 1 . If X is a metric space and C C X has minimum inter-point distance h, then \C\ < 
(2 rad {X)/bf {ddhn{X) \ 

Proof. C can be covered by \C\ open balls of radius b centered at the points of C. By repeatedly 
applying the definition of the doubling constant, C (and in fact all of X) can be covered by k = 

\( y X)°( ra,d ( x ^ b ' 1 = [ j j balls of radius | centered at points of X. By the triangle 

inequality, each of these | -radius balls is completely contained in some 5-radius ball centered at points 
of C, hence \C\ < k. □ 


The central contribution of this section is the following lemma. It demonstrates that for semimet¬ 
rics, a doubling property does not imply a packing property (unlike for metrics, Lemma|T]). However, 
a finite density constant does imply a packing property. 

Lemma 2. In semimetric spaces, the doubling constant does not imply a packing property, while the 
density constant does. In particular, 

(a) There exist semimetric spaces X of arbitrary cardinality with a universally bounded doubling 
constant X(X) = 0(1), such that X contains a rad (X)-net C of size 0(|Tf|). 

(b) For any semimetric space X and b > 0, the size of any b-net of X is 

/ 2rad(^) ^ 0(densW) 
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Proof. ©. Let X be composed of two sets, A and A'. Put A = (ai,..., a n }, endowed with the line 
metric p(ai,a,j ) = \i — j |, so the maximum distance in A is n — 1. Note that A(.4) = 0(1). Define 
A' to consist of n points, such that 

p(a' i , dj) = p(di, dj) + ft {i=j}, (f > 0 infinitesimal), 

while p(a(, a') = n — 1. This defines a semimetric on X. 

Clearly, A' forms a rad (X )-net of size \X\f% and yet we can show that X(X) = 0(1). Indeed, 
consider any ball B r (x ) in X. Then all points in B r (x) can be covered by the same A(A) = 0(1) 
balls of radius | that cover A n B r (x). The claim follows. 

©. Suppose the radius of X is II. Partition X into clusters by extracting from X an arbitrary 
net D with minimum inter-point distance R/ 2, and assigning each point p £ X to a cluster centered 
at the nearest neighbor of p in I). Then apply the procedure recursively to each cluster (halving 
the previous radius), until reaching point sets with minimum inter-point distance at least b. Clearly, 
an appropriate choice of the subsets can yield a final set containing C. For example, the first set 
may contain all points in the /i/2-nct of C, the second all points in the R/ 4-net of C, etc. By 
repeatedly applying the definition of the density constant, the size of the final set is bounded by 

^( < Y) 1 ° g 2 ( 2rad (nf)/b) _ ^ 2 rad(nr) ~j ( and this bounds | C | as well. □ 


In fact, a deeper principle underlies the results above: In metric spaces, the doubling and density 
constants are almost the same, while in semimetric spaces there may be a large gap between them. This 
is captured in the following lemma, which delineates the relationship between the doubling constant 
and density constant. (The first half of the lemma is due to G o t tlieb a nd Krau thgamer lt2013ll .) 


Lemma 3. Let X be point set endowed with a metric distance function. Then 

(a) X(X) < p(X), 

(b) v^) < X(X). 

Let y be a point set endowed with a semimetric distance function. Then 

(c) x(y) < p(y), 


(d) p{y) may be as large as 0(13^1), even when A(y) = 0(1). 

Proof. To prove daj) and ©, that A < //: Consider any open ball B r {x) € X. Let C be a maximal 
collection of points at mutual inter-point distance at least |, and note that by definition \C\ < p(X). 
By the maximality of C, \C\ balls of radius | centered at points of C cover all of B r (x), so A(T”) < 
\C\ < p(X). For ©: again, consider any open ball B r (x) € X, and let C be a maximal collection of 
points at mutual inter-point distance at least Now, by definition X may be covered by X(X) balls 
of radius and each of these smaller balls may be covered by A(T') balls of radius |, so there exists 
a set of X 2 (X) balls of radius | covering all of X, and in particular C. By the triangle inequality, 
each ball of radius | can cover at most one point of C, and so \C\ < X 2 (X). Finally, © follows 
immediately from Lemma [2] □ 


4 Basic constructions and the density constant 

Before presenting our classification algorithms in Section [5J we will show how to execute two basic 
constructions — r-net and nearest neighbor search — for semimetrics with finite density constant. 
These results are strictly worse than the corresponding state of the art for metric spaces. 
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Net extraction and condensing. In Lemma[2]above, we bounded the r-packing number of semimet¬ 
ric spaces, which in turn bounds the size of the largest r-net of the space. For a metric set S, it is known 
how t o extract an r-net in time 2° ( ddim ( g ))|,Sj min{log(rad (5 , )/r), log |5|| IIKrauthgamer and Led. 
2004 . Har-Peled and Mendel . 2006 . Cole and Gottlieb . 2006ll . The following result holds for semi¬ 
metric spaces. 

Lemma 4. Given a set S equipped with a semimetric distance function, an r-net of S of size 

k = M (£)log 2 (2rad(S)/6 ) = / 2 rad(S) \ ° (dens(S)) 


can be extracted in time 0(fc|S|). 

Proof We greedily build an r-net for S. Initialize set C = 0, and for every point in S, add it to C if 
its closest neighbor in C is at distance r or greater. By Lemma[2j \C\ < k, and so the total runtime is 
0(k\S\). See Algorithm [T|in the Appendix. □ 


Nearest neighbor search. Finally, we juxtapose the time bounds for nearest neighbor search in 
metric and semimetric spaces. In metric spaces, the following bounds on exact and approximate 
nearest neighbor search are well-known (the proof is deferred to the Appendix): 

Lemma 5. Given a point set S equipped with a metric distance function, and a query point x: 

(a) Locating the exact nearest neighbor of x in S requires 0 (| S |) comparisons in the worst case. 

(b) A (1 + e) -approximate nearest neighbor of x in S can be found in time 

2 ddim(S) i Q g |g| _|_ £ -0(ddim(S))_ 

For semimetric spaces, we demonstrate that the situation is much worse: 

Lemma 6. Given a point set S equipped with a semimetric distance function, discovering an exact or 
approximate nearest neighbor requires 0(|Sj) comparisons in the worst case. 

Proof. For the upper bound, trivially 0(|<Sj) time is sufficient to consider every point in S. 

For the lower bound, suppose the query point q is at an infinitesimally small distance from a single 
point s o €= S, and at distance 2 rad(S') from all other points of S. Then so can be any point in S, and 
cannot be located without inspecting each point: Without the triangle inequality, the distance between 
one pair of points has no bearing on any other distance. □ 


5 Classification algorithms 

In this section, we present a classification algorithm for semimetric spaces. For a labeled sample S, 
recall that the margin of S is the minimum distance between oppositely labelled points in S, as defined 
formally in The margin of a given sample can be computed in time 0(|Sj 2 ) by considering all 
pairs of points. 

We consider the problems of producing both consistent and inconsistent 1-NN classifiers for the 
sample (see Section [2]). We begin with a consistent classifier. 
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Theorem 7. Let S be a sample set equipped with a semimetric distance function, and let the margin 
7 of S be given. In time 0(/c|5|) we can construct a nearest-neighbor classifier that achieves zero 

training error on S, where k = ^ 2 ^ 77? c' evaluation time for a test point is O(k), and 

with probability 1 — 5, the resulting classifier has generalization error O ^ k log "^ log 1 

Proof. We build a 7 -net C for S in time 0(/c|,S|), as in Lemma0] Since 7 is the margin, by con¬ 
struction every point in S has the same label as its nearest neighbor in C, and so the nearest neighbor 
classifier with regards to C has zero sample error. 

Given a test point x, we assign it the same label as its nearest neighbor in C. By Lemma[ 6 j (-)(/;•) 
operations are necessary and sufficient to locate the nearest neighbor. The generalization bounds 
follow from Theorem 177)1 i). □ 

The procedure in Theorem [Tj compresses S, producing a consistent sub-sample C. Imme¬ 
diate from the theorem is that the smaller the compressed set C, the better the generalization 
bounds of the classifier. However, as Gottlieb et al. 1 2014bh recently demonstrated, even in metric 
spaces, it is NP-hard to approximate the size of the minimum consistent subset to within a factor 
20 ((ddim( 5 )log( 2 rad( 5 )/marg( 5 ))) 1 -°P) _ 20 ((dens(S) log (2 radfS 1 )/margfS))) 1 -^ 1 ) ( w here the equality fol¬ 
lows from Lemma[3]). This means that choosing the net of Lemma[4]is close to the optimal construction 
for a consistent subset of S. 

It is natural to ask whether allowing the classifier nonzero sample error results in improved gener¬ 
alization bounds. This is indeed generally the case, as the bound in Theorem [TT]indicates. Optimizing 
this bound is an instance of Structural Risk Minimization (SRM). Unfortunately, we can show SRM 
to be infeasible for this problem: 

Theorem 8 . Given a set S equipped with a metric or semimetric distance function, let S* C S be 
a sub-sample for which the generalization bound Q(d. e) in TheoremM l\(for a fixed constant 5) is 
minimized. Then it is NP-hard to compute any subset of S achieving a generalization bound within 
factor 2 °^ dens ^^ Io s ( 2 rEidCS 1 )/ mar g(S ’))) 1 o(1) Q fp ie generalization bound induced by S*. 

Proof. The proof is via reduction from the minimum consistent subset problem, which was shown 
by Gottlieb et al. [2014b] to be hard to approximate. Fix the confidence level 5 in the bound, let T 
be an instance of the minimum consistent subset problem, and put m = \T\. For some large value 
p, replace each point tj € T with a set of p points ..., s l:f) obeying the line metric, so that 
p{si, a , S{ t b) = 4>\a ~ b\ for an infinitesimally small f. Put p{s ^ a , •‘ 77 ) = p(ti,tj). The new set is S, 
with n = |S| = pm. 

Consider a subset S' C S. If the 1-NN rule on S' misclassifies a point of S, say s,, a , then in 
fact it must misclassify all p points .77, b <G [ I. p]. So an inconsistent subset of S achieves a value of 
Q(\S'\,p/n) = £l(p/n) in the generalization bound. 

Now consider the consistent subset of S consisting of rn = n/p points sip for i e T, m]. This 
classifier achieves a generalization bound of O l '*” " j = O (^ 77 ^. So when p = £l(y/n logn), 
this consistent classifier is better than any inconsistent classifier, and by increasing p we can amplify 
this gap arbitrarily. Now a consistent subset of size d < m has generalization bound O ( d ^ " j. 

As it is NP-hard to find a subset whose size is within a factor 2 °(( dens ( 5 ) log ( 2rad ( 5 )/ marg (' s '))) 1 ° (1) 
of the smallest consistent subset, it is NP-hard to find a consistent subset with generalization bound 
within a factor 2 0 d <]e <^( s ) lo s( 2 r;id(S)/ rnargOS '))) 1 " I l ! 0 f q lc optimal consistent subset, and the theorem 
follows. □ 

















Let us turn our attention to the margin-based generalization bound provided by Theorem [13] As 
before, we wish to perform SRM for this bound. Fortunately, we are able to compute the latter exactly 
in polynomial time, and even more efficiently if we are willing to settle for a solution within a constant 
factor of the optimal: 

Theorem 9. Given a sample set S equipped with a semimetric: 

(a) A nearest-neighbor classifier minimizing the generalization bound of Theorem [/J] can be com¬ 
puted in randomized time 0(|*S'| 4 ' 373 ). 


(b) A nearest-neighbor classifier whose generalization bound is within factor 2 of optimal can be 
computed in deterministic time 0(|S '| 2 log |5|). 


Each of these classifiers can be evaluated on test points in time 
margin imposed by the SRM procedure. 


^ radjg) ^ Q( de ns( 5 )) 


where 7 is the 


Proof For each of these solutions, we enumerate and sort in increasing order the distances between all 
oppositely labelled point pairs in S, in total time Od^S *! 2 log ISI). Each distance constitutes a separate 
guess for the optimal margin to “impose” on S. That is, for each distance 7 , we will remove from S 
some points to ensure that all opposite labelled pairs are more than 7 far apart. 

To accomplish this, we iteratively build a new graph G. We initialize G with vertices represent¬ 
ing the points of S. At each round we add to G an edge between the next closest pair of opposite 
labelled points, as given by the sorted enumeration above. This distance is the margin of the current 
round: Points connected by an edge in G represent pairs that are too close together for the current 
margin, and we need to compute how many points must be removed from G in order for no edge to 
remain in the grap h. (However, no points or edges will actually removed from G.) As observed by 


Got tli eb et al. 112014all . this task is precisely the problem of bipartite vertex cover. By Konig’s theorem. 


the minimum vertex cover problem in bipartite graphs is equivalent to the maximum matching prob¬ 
lem, and a maximum matching in bipar t ite gra phs can be computed in randomized time 0 (n 2,373 ) 
I Mu cha and Sankowski . 20041 Williams . 2012 1. So for each candidate margin, we can compute in 
0(n 2 - 373 ) time the number of points that must be removed from the current graph G in order to re¬ 
move all edges. For 0(n 2 ) possible margins, this amounts to 0(n 4 373 ) time. Having computed 
for each inter-point distance the number of points required to be deleted to achieve this distance, we 
choose the distance-number pair which minimizes the bound of Theorem [13] We then remove these 
points from S, and use the algorithm of Lemma [4] to construct a net satisfying the margin bound. 

The runtime improvement in © comes from a faster vertex-cover computation. It is well known 
that a 2 -approximation to vertex cover can be computed (in arbitrary graphs) by a greedy algorithm in 
time linear in the graph size 0(\V + U V~\ + |.E|) = 0(n 2 ), see e.g. Bar-Yehuda and Even 11198111 . 
This algorithm simply chooses any edge and removes both endpoints, until no edges remain. We 
apply this algorithm to our setting: Copy set S to T, and iteratively remove from T the next closest 
pair of oppositely-labelled points, as given by the sorted enumeration above. For each distance, we 
record how many points have been removed from T, and this is a 2-approximation for the minimum 
number of points that must be removed in order to attain this distance. Having computed for each 
inter-point distance the number of points required to be deleted to achieve this distance, we choose the 
distance-number pair which minimizes the bound of Theorem [T3] We then remove these points from 
S, and use the algorithm of Lemma [4] to construct a net satisfying the margin bound. The runtime is 
dominated by the time required to sort the distances. 
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For both algorithms, a new point is classified by finding its nearest neighbor in the extracted 
net. □ 


6 Generalization guarantees 

In this section, we provide general sample compression bounds, which then will be specialized to 
the nearest-neighbor classifier proposed above. Theorem QT] presents a smooth interpolation between 
two classic bounds: the consistent case with rate 0(l/n), and the agnostic case with rate 0{l/yfn). 
Applied to margin-based semimetric sample-compression schemes, this result yields the efficiently 
computable and optimizable bound in Theorem [[3] which is nearly optimal (as shown in Theorem [8]). 
Finally, the lower bound in Theorem [14] shows that even under margin assumptions, there exist adver¬ 
sarial distributions forcing the sample complexity to be exponential in dens. 


6.1 Sample compression schemes 


We use the notion of a sample compression scheme in the sense of iGraepel et al.l [2005], where it is 
treated in full rigor. Informally, a learning algorithm maps a sample S of size n to a hypothesis h$. 
It is a (/-sample compression scheme if a sub-sample of size d suffices to produce a hypothesis that 
agrees with the labels of all the n points. It is an e-lossy (/-sample compression scheme if a sub-sample 
of size d suffices to produce a hypothesis that disagrees with the labels of at most en of the n sample 
points. 

The algorithm need not know d and e in advance. We say that the sample S is id. e)-compressible 
if the algorithm succeeds in finding an e-lossy (/-sample compression scheme for this particular sam¬ 
ple. In this case: 


Theorem 10 ( Gra epel et al. [2005]). For any distribution over X x {—1,1}, any n € N and any 
()<() < 1, with probability at least 1 — 5 over the random sample S of size n, the following holds: 


(i) If S is ( d,0)-compressible, then err(hs) < 


(ii) If S is (d, efcompressible, then err (hs) < 


1 


n — d 


{d + 1) logn + log 


en (d + 2) log n + log | 


n — d 


+ 


2(n — d) 


The generalizing power of sample compre ssion was independently di scovered by 


Littlesto ne and Warmu th [1986|], iDevrove et all 11199611 . and later elaborated upon by IGraepel et al 
liooi . The bounds above are already quite usable, but they feature an abrupt transition from the 


(logn)/n decay in the lossless (e = 0) regime to the \J (log n)/n decay in the lossy regime. We 
now provide a smooth i nterpol ation between the two (such results are known in the literature as “fast 
rates” I Boucheron et al. . 2005 1): 


Theorem 11. Fix a distribution over X x {—1,1}, an n £ N and 0 < 6 < 1. With probability 
at least 1 — 5 over the random sample S of size n, the following holds for all 0 < e < If S is 
(r/, s)-compressible, then 

2 n d+2 /qoM _ ri d +2 

^ £ £ ~ + + \/2(bdr log —= : «® 

where e = . 

n—a 
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Proof. We closely follow the argument in Graepel et al. [2005, Theorem 2], with the twist that instead 
of Hoeffding’s inequality, we use Bernstein’s. The particular form of the latter is due to Dasgupta and Hsu 
1 20081. Lemma 1]: if p Bm(n, p) / n and 5 > 0, then 


2 1 /9p(l-p), 1 

P - P+ 3^ hg S + V & ° g 5 


( 6 ) 


holds with probability at least 1 — 5. 

Now suppose that S is (d, k/n) -compressible, as witnessed by some sub-sample S C S of size d. 
In particular, the hypothesis hg induced by the sub-sample S makes k or fewer mistakes on the n — d 
points in S \ S. Substituting p = err (hg) and 


P = evr s\s( h s ) : = 


1 


£ S 


I q \ cl / y {/ig makes a mistake on xj- — ^ ^ 

' ' ' xGS\S 


k 


L S\S\ n S 

into ([ 6 ]) yields that for fixed S and random S\S, with probability at least 1 — 5, 


= £ 


. . _ . 2 1 /9e(l -i) 1 1 

err(hs) < err^(fc 4 ) + ^log j J ‘° g S’ 


(V) 


where we used the monotonicity of t >->■ t( 1 — t) on [0, ^]. To see that ([7]) follows from ©, note that 
when S of size d is fixed and S\S is drawn iid ~ P, we have (n—d)erimg(hg) ~ Bin(n— d, err (h§)). 

To make © hold simultaneously for all S C S, divide 5 by n d — the number of ways to choose a 
(multi)set S of size d. To make the claim hold for all d € [n] and all 0 < e < 1 , stratify (as in 
Graepel et al. 1 2005 . Lemma 1]) over the n 2 possible choices of d and k, which amounts to dividing 5 
by an additional factor of n 2 . 

□ 


6.2 Margin-based nearest neighbor compression 

We now specialize the general sample compression result of Theorem fill to our setting, where hs> 
induced by a sub-sample S' C S is given by the 1-NN classifier defined in (HI). Any sample S 
of size n is trivially (n, 0)-compressible and (0, i)-compressible — the former is achieved by not 
compressing at all, and the latter by a constant predictor. Now d and e cannot simultaneously be made 
arbitrarily small, and for non-degenerate samples S, the bound Q in Theorem QT] will have a nontrivial 
minimal value Q*. Theorem[ 8 ]shows that computing Q* is intractable and the algorithm in Theorem[9] 
solves a tractable modification of this problem. For k E N and 7 > 0, let us say that the sample S 
is (k, ^-separable if it admits a sub-sample S' C S such that \S \ 5 ; | < k and marg(S ,/ ) > 7 , and 
observe that separability implies compressibility: 

Lemma 12. If S is (k, ^-separable then it is ^(S') log 2( 2rad (' s ')/7) ) -compressible. 

Proof. Suppose S' C S is a witness of (A:, yj-scparability. Being pessimistic, we will allow our lossy 
sample compression scheme to mislabel all of S \ S', but not any of S', giving it a sample error 
£ < A Now by construction, S' is (0, 7 )-separable, and thus a 7 -net S C S' suffices to recover the 

correct labels of S' via 1-nearest neighbor. Lemma[ 2 ]provides the estimate ,S' < /i(S') log2(2 ra ‘' 1( A)/7). 
whence the compression bound. □ 
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These observations culminate in an efficiently optimizable margin-based generalization bound: 

Theorem 13. Fix a distribution over X, an n £ IT and 0 < 6 < 1 . With probability at least 
1 — 5 over the random sample S of size n, the following holds for all 0 < k < n/2: If S is ( k, 7 )- 
separable with witness S', then err (hg 1 ) < Q(d,k/n) =: R(k, 7 ), where Q is defined in Q and 
d = /i(S") log 2 ( 2rad ( 5 ')/ 7 ). Furthermore, the minimizer (k*, 7*) of Rf, •) is efficiently computable. 


6.3 Sample complexity lower bound 

The following result shows that even under margin assumptions, a sample of size exponential in dens 
will be required for some distributions. 


Theorem 14. For every semimetric space (X, p), there is a distribution P such that err (/) = 0 for 
some “target” concept f : X —>• {—1,1}, yet for any learning algorithm mapping samples S of size n 

( \/ n(fY)^ 0 &2 (2 rad (S)/ marg(£>)) 

{— 1 , 1 }, we have, with high probability, err (h n ) = Q v ; --- 


to hypotheses h n : X 




Proof The definition of the density constant implies the existence of k = piX) = 2 dens (‘U nearly 
equidistant points {xi}, suc h that 1 < p(xj, xP < 2 for all 1 < i < 7 < k. Following the standard 
VC lower bound argument I Blumer et al. . 1989 . Hhrenfeucht et al. . 1 989l l. we construct P by putting 
a mass of 1 — 8 a on one of the k points and distributing the remaining mass uniformly over the other 
k—1 points. The target / : {x,} —>• {—1,1} is drawn uniformly at random from among the 2 k choices, 
so as to thwart any learning algorithm. For fixed 0 < e < | and 0 < 5 < y^, this choice ensures that 
a sample of size (|) is required in order to produce an e-accurate hypothesis with 5-confidence. 
Inverting for e = err (h n ) will yield the claim — as soon as k and I := p(Al) log2 ^ 2rad ^) //marg (‘ 5 ^ can 
be tied together. 

By construction, 0 < marg(5) < rad(S') < 00, except for two possible degenerate cases: (a) 
rad(S') = 0 and (b) marg(S') = 00. Case (a) occurs when S consists of a single point, with probability 
decaying as e~ 8en . Case (b) occurs when / assigns the same label to all k points, with probability 
2~ k+l . Thus, with overwhelming probability, log 2 (2rad(S')/marg(S')) > 1. Since rad(S') < 2, by 
construction, we also have log 2 (2rad(5)/marg(S')) < 2. It follows that k < i < k 2 , which yields 
the claim. □ 
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A Figures and deferred proofs 

Figure accompanying the definition: Sub-sample, margin, and induced 1-NN. 


marg(S') 


.© 
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I marg(S') 



Figure 1: In this example, the sub-sample S C 5 is indicated by double circles. It is always the case 
that rnarg(S) > marg (S). 


Algorithm accompanying Lemma [4] 


Algorithm 1 Brute-force net construction 
Require: sample S, margin r 
Ensure: C is an r-net for S 
for x € S do 

if p(x, C) > r then 
C = Cu{x} 
end if 
end for 


Proof of Lemma H] 


Proof. To prove ©, let S be a set of points obeying the line metric, i.e. the distance between .s,, Sj € S 
is |* — j |. Suppose x is at distance n = |.S’| from .s,, and at distance n + 1 from all other points of S. 
Then Si can be any point of S, and cannot be located without inspecting each point. The claim in © 
is the result of Krauthgamer and Led 120041. □ 
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